[Core] fix gRPC handlers' unlimited active calls configuration #25626

mwtian · 2022-06-09T18:27:59Z

Why are these changes needed?

Ray's gRPC server wrapper configures a max active call setting for each handler. When the max active call is -1, the handler is supposed to allow handling unlimited number of requests concurrently. However in practice it is often observed that handlers configured with unlimited active calls are still handling at most 100 requests concurrently.

This is a result of the existing logic:

At a high level, each gRPC method is associated with a number of ServerCall objects (acting as "tags") in the gRPC completion queue. When there is no tag for a method, gRPC server thread will not be able to poll requests from the method call from the completion queue. After a request is polled from the completion queue, it is processed by the polling gRPC server thread, then queued to an eventloop.
When a handler is in the "unlimited" mode, it creates when a new ServerCall object (tag) before actual processing. The problem is that new ServerCalls are created on the eventloop instead of the gRPC server thread. When the event loop runs a callback from the gRPC server, the callback creates a new ServerCall object, and can run the gRPC handler to completion if the handler does not have any async step. So overall, the event loop will not run more callbacks than the initial number of ServerCalls, which is 100 in the "unlimited" mode.

The solution is to create a new ServerCall in the gRPC server thread, before sending the ServerCall to the eventloop.

Running some night tests to verify the fix does not introduce instabilities: https://buildkite.com/ray-project/release-tests-branch/builds/652

Also, looking into adding gRPC server / client stress tests with large number of concurrent requests.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

fishbone · 2022-06-09T23:33:41Z

src/ray/rpc/grpc_server.cc

-      int buffer_size = 100;
+      // When there is no max active RPC limit, a call will be added to the completetion
+      // queue before processing starts, so adding only 1 call is enough.
+      int buffer_size = 1;


I'm not very familiar with this. But does this mean we'll accept request slower than before? For example, accept the request 1by1 because the size ie only 1 and only after we do the actual processing, a new one will be added.

The nightly tests do not show a big difference, but from reading the code of completion queue, I can see that having a number of tags buffered can make processing more efficient. Adjusted the value to be closer to the existing value.

#25626)" This reverts commit dcfed61.

#25626)" (#26202) Reverts #25626 Closes #26195.

* master: (104 commits) [Serve] Java Client API and End to End Tests (ray-project#22726) [Docs] Small fix to AIR examples descriptions (ray-project#26227) [Deployment Graph] Move `Deployment` creation outside to build function (ray-project#26129) [K8s][Ray Operator] Ignore resource requests when detected container resources. (ray-project#26234) Revert "[Core] Add retry exception allowlist for user-defined filteri… (ray-project#26289) [ci] pin gpustat (ray-project#26311) [tune] fix `set_tune_experiment` (ray-project#26298) Revert "Revert "[AIR][Serve] Rename ModelWrapperDeployment -> PredictorDeployment"" (ray-project#26231) [Release] Use nightly base images for release tests (ray-project#25373) Revert "[Core] fix gRPC handlers' unlimited active calls configuration (ray-project#25626)" (ray-project#26202) [RLlib] Some Docs fixes (2). (ray-project#26265) [C++ worker] Refine worker context and more (ray-project#26281) Fix file_system_monitor.cc message (ray-project#26143) [Java] Make Java test more stable (ray-project#26282) [air] Do not warn of `checkpoint_dir` if it's coming from us (base_trainer). (ray-project#26259) [Datasets] Support drop_columns API (ray-project#26200) [Datasets] Fix max number of actors for default actor pool strategy (ray-project#26266) [ci] Stop syncer staging tests (ray-project#26273) [core][gcs] Add storage namespace to redis storage in GCS. (ray-project#25994) [workflow] Deprecate workflow.create (ray-project#26106) ...

mwtian added 2 commits June 8, 2022 21:04

fix

18e164b

fix unlimited

3251f2d

mwtian force-pushed the grpc-fix branch from 70a44e9 to 3251f2d Compare June 9, 2022 18:47

mwtian mentioned this pull request Jun 9, 2022

[Core][Nightly-test] scheduling_test_many_0s_tasks_many_nodes failed #24234

Closed

mwtian changed the title ~~[WIP]~~ [Core] fix gRPC handlers' unlimited active calls configuration Jun 9, 2022

mwtian assigned fishbone Jun 9, 2022

mwtian marked this pull request as ready for review June 9, 2022 22:30

fishbone reviewed Jun 9, 2022

View reviewed changes

increase buffer size

764f3a9

fishbone approved these changes Jun 10, 2022

View reviewed changes

fishbone merged commit dcfed61 into ray-project:master Jun 10, 2022

scv119 mentioned this pull request Jun 29, 2022

[Core] Microbenchmark performance regressions in multi_client_put_calls_Plasma_Store and client__tasks_and_put_batch #26195

Closed

mwtian deleted the grpc-fix branch June 29, 2022 20:47

mwtian added a commit that referenced this pull request Jun 29, 2022

Revert "[Core] fix gRPC handlers' unlimited active calls configuration (

20e2e47

#25626)" This reverts commit dcfed61.

mwtian mentioned this pull request Jun 29, 2022

Revert "[Core] fix gRPC handlers' unlimited active calls configuration" #26202

Merged

stephanie-wang pushed a commit that referenced this pull request Jul 5, 2022

Revert "[Core] fix gRPC handlers' unlimited active calls configuration (

ccabba8

#25626)" (#26202) Reverts #25626 Closes #26195.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] fix gRPC handlers' unlimited active calls configuration #25626

[Core] fix gRPC handlers' unlimited active calls configuration #25626

mwtian commented Jun 9, 2022 •

edited

Loading

fishbone Jun 9, 2022

mwtian Jun 10, 2022

[Core] fix gRPC handlers' unlimited active calls configuration #25626

[Core] fix gRPC handlers' unlimited active calls configuration #25626

Conversation

mwtian commented Jun 9, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

fishbone Jun 9, 2022

Choose a reason for hiding this comment

mwtian Jun 10, 2022

Choose a reason for hiding this comment

mwtian commented Jun 9, 2022 •

edited

Loading